We will begin by running a simple linear model that regresses weekly
sales onto Consumer Price Index (CPI)
# Specifying our model type and setting the computational engine
linear_model <-
linear_reg() %>%
set_engine("lm")
# Fitting the model
fit_cpi <-
linear_model %>%
fit(weekly_sales ~ cpi, data = dfw)
# Model output
summary(fit_cpi$fit)
##
## Call:
## stats::lm(formula = weekly_sales ~ cpi, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -662386 -318443 -73868 258442 2095880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 827280.5 21778.4 37.986 < 2e-16 ***
## cpi -732.7 123.7 -5.923 3.33e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 390600 on 6433 degrees of freedom
## Multiple R-squared: 0.005423, Adjusted R-squared: 0.005269
## F-statistic: 35.08 on 1 and 6433 DF, p-value: 3.332e-09
In this model, a Walmart store with a theoretical square footage of
0 can expect its weekly sales to be ~$828,280 if CPI is
held constant. We also observe that the relationship between
Weekly_Sales and CPI is negative. That is, if CPI increases by one unit,
weekly sales will decrease by ~$733; and if CPI decreases by one unit,
sales would increase by ~$733.
In evaluating the model statistics, we can see an Adjusted R_Squared
value of 0.005269. In other words, this model explains only roughly 0.5%
of the variance in Walmart’s weekly sales. So, while our interpretation
of the effect of CPI on Weekly_Sales is still valid, we must
conclude that this model appears to fail in explaining the variance in
our target variable.
# Now we will plot the affect of CPI on sales for a few different stores in the dataset, starting with store 10.
plot_store_10 <-
dfw %>%
filter(store == 10) %>%
ggplot(aes(x = cpi, y = weekly_sales)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = 'Weekly Sales vs. CPI for Store 10', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
## filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_10)
## `geom_smooth()` using formula = 'y ~ x'
plot_store_11 <-
dfw %>%
filter(store == 11) %>%
ggplot(aes(x = cpi, y = weekly_sales)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = 'Weekly Sales vs. CPI for Store 11', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
## filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_11)
## `geom_smooth()` using formula = 'y ~ x'
plot_store_12 <-
dfw %>%
filter(store == 10) %>%
ggplot(aes(x = cpi, y = weekly_sales)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
## filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_12)
## `geom_smooth()` using formula = 'y ~ x'
plot_store_13 <-
dfw %>%
filter(store == 13) %>%
ggplot(aes(x = cpi, y = weekly_sales)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = 'Weekly Sales vs. CPI for Store 13', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
## filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_13)
## `geom_smooth()` using formula = 'y ~ x'
# A plot to demonstrate the fluctuation of CPI by region/store. Note that the
# smoothed line is negative in some locales and positive in others.
animated_plot <-
dfw %>%
filter(store %in% c(11:15)) %>%
ggplot(aes(x = cpi, y = weekly_sales)) +
geom_point() +
geom_smooth(method = lm) +
labs(title = 'Weekly Sales vs. CPI for Store {closest_state}', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal() +
gganimate::transition_states(store, transition_length = 1, state_length = 2) +
gganimate::view_follow()
## filter: removed 5,720 rows (89%), 715 rows remaining
animated_plot
## `geom_smooth()` using formula = 'y ~ x'

What we observe here is that the impact of CPI can vary greatly by
store/region. This still aligns with our evaluation of fit_cpi because
we recall that that particular model explained only a small amount (~5%)
of the variance in Weekly_Sales, so we would expect to see these kinds
of swings. With a (much) higher Adjusted R-Squared, these variations
would look unusual.
dfw %>%
group_by(store) %>%
group_modify(~ tidy(lm(weekly_sales ~ ., data = .x))) %>%
filter(term == "cpi")
## group_by: one grouping variable (store)
## filter (grouped): removed 315 rows (88%), 45 rows remaining
# Filtering for 2012 and plotting CPI against Weekly_Sales
plot <- dfw %>%
filter(lubridate::year(date) == 2012) %>%
ggplot(aes(x=cpi, y = weekly_sales)) +
geom_point() +
geom_smooth(method=lm)
## filter: removed 4,500 rows (70%), 1,935 rows remaining
labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
## NULL
plotly::ggplotly(plot)
## `geom_smooth()` using formula = 'y ~ x'
We see an interesting effect when we filter for one specific year.
The clusters are nearly vertical because CPI is calculated
geographically, with either Core Based Statistical Area (CBSA) or
Metropolitan Statistical Area (MSA). CPI might be the same in a
particular region, but different stores in that region will have
different sales volume, hence the vertical clusters.
# Now let's look exclusively at store 10
plot_store_cpi <- dfw %>%
filter(store==10, lubridate::year(date)==2012) %>%
ggplot(aes(x=cpi, y = weekly_sales)) +
geom_point() +
geom_smooth(method=lm)
## filter: removed 6,392 rows (99%), 43 rows remaining
labs(title = 'Weekly Sales vs. CPI for Store 10 in 2012', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
## NULL
plotly::ggplotly(plot_store_cpi)
## `geom_smooth()` using formula = 'y ~ x'
Although CPI varies by region, the deviation in CPI across time for
a single region tends to be much lower, which is why we see such a slim
range here. Since CPI is a measure of inflation, we expect to see these
regional effects.
# A new iteration of the previous model that also includes store Size as an independent variable
fit_cpi_size <-
linear_model %>%
fit(weekly_sales ~ cpi + size, data = dfw)
summary(fit_cpi_size$fit)
##
## Call:
## stats::lm(formula = weekly_sales ~ cpi + size, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -563750 -167145 -29612 112172 1912650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.828e+05 1.497e+04 12.216 <2e-16 ***
## cpi -6.570e+02 7.692e+01 -8.542 <2e-16 ***
## size 4.847e+00 4.796e-02 101.048 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 242800 on 6432 degrees of freedom
## Multiple R-squared: 0.6156, Adjusted R-squared: 0.6155
## F-statistic: 5151 on 2 and 6432 DF, p-value: < 2.2e-16
# Comparing fit_cpi to fit_size to see which is better at explaining the variance in Weekly_Sales
anova(fit_cpi$fit, fit_cpi_size$fit)
Note also that the coefficient in the revised model has been reduced
from ~$733 to ~$657. This is simply due to the fact that size is now
explaining more of the variance that was left unexplained by the
previous model that only included CPI.
# Building a model that uses all variables EXCEPT Date and Store
fit_full <-
linear_model %>%
fit(weekly_sales ~ . - store - date, data = dfw)
summary(fit_full$fit)
##
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -557148 -165608 -24125 112851 1918479
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.133e+05 3.546e+04 8.834 < 2e-16 ***
## isholidayTRUE 6.012e+04 1.196e+04 5.026 5.14e-07 ***
## temperature 1.002e+03 1.739e+02 5.761 8.72e-09 ***
## fuel_price -1.333e+04 6.822e+03 -1.954 0.0507 .
## cpi -9.461e+02 8.445e+01 -11.203 < 2e-16 ***
## unemployment -1.252e+04 1.725e+03 -7.258 4.40e-13 ***
## size 4.840e+00 4.802e-02 100.786 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 241200 on 6428 degrees of freedom
## Multiple R-squared: 0.621, Adjusted R-squared: 0.6206
## F-statistic: 1755 on 6 and 6428 DF, p-value: < 2.2e-16
anova(fit_cpi_size$fit, fit_full$fit)
We observe a further, though slight improvement in the Adjusted
R-Squared value in the new model that eliminates temporal and regional
effects (fit_full). The ANOVA test also confirms that the improvement in
explanatory power is indeed statistically significant.
More Linear Regression
We hypothesize that the effect of good weather is increased on
holidays. We can test this by revising fit_full and including an
interaction term.
fit_full_int <-
linear_model %>%
fit(weekly_sales ~ . - store - date + isholiday * temperature, data = dfw)
summary(fit_full_int$fit)
##
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date + isholiday *
## temperature, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -557499 -165415 -24493 112914 1918376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.148e+05 3.565e+04 8.830 < 2e-16 ***
## isholidayTRUE 4.745e+04 3.265e+04 1.453 0.1462
## temperature 9.809e+02 1.808e+02 5.424 6.04e-08 ***
## fuel_price -1.342e+04 6.826e+03 -1.966 0.0493 *
## cpi -9.460e+02 8.446e+01 -11.200 < 2e-16 ***
## unemployment -1.251e+04 1.725e+03 -7.254 4.53e-13 ***
## size 4.840e+00 4.802e-02 100.779 < 2e-16 ***
## isholidayTRUE:temperature 2.473e+02 5.932e+02 0.417 0.6768
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 241200 on 6427 degrees of freedom
## Multiple R-squared: 0.621, Adjusted R-squared: 0.6206
## F-statistic: 1504 on 7 and 6427 DF, p-value: < 2.2e-16
anova(fit_full$fit, fit_full_int$fit)
Although the results of our fit_full_int model demonstrate that the
effect of good weather is indeed more significant on holidays, the ANOVA
test shows no statistically significant improvement. We cannot assert
definitively that this model with the interaction term is an
improvement.
We’ll also test whether the effect of temperature on weekly sales is
linear by squaring that variable.
fit_full_sq <-
linear_model %>%
fit(weekly_sales ~ . - store - date + I(temperature ^2), data = dfw)
summary(fit_full_sq$fit)
##
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date + I(temperature^2),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -561455 -165260 -24674 112058 1911166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.610e+05 4.111e+04 6.350 2.30e-10 ***
## isholidayTRUE 6.230e+04 1.199e+04 5.197 2.09e-07 ***
## temperature 3.294e+03 9.301e+02 3.542 0.0004 ***
## fuel_price -1.471e+04 6.841e+03 -2.151 0.0315 *
## cpi -9.547e+02 8.449e+01 -11.300 < 2e-16 ***
## unemployment -1.253e+04 1.724e+03 -7.268 4.09e-13 ***
## size 4.831e+00 4.811e-02 100.420 < 2e-16 ***
## I(temperature^2) -1.982e+01 7.901e+00 -2.509 0.0121 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 241100 on 6427 degrees of freedom
## Multiple R-squared: 0.6214, Adjusted R-squared: 0.621
## F-statistic: 1507 on 7 and 6427 DF, p-value: < 2.2e-16
anova(fit_full$fit, fit_full_sq$fit)
## Plotting the relationship between Temperature^2 and Weekly Sales
dfw %>%
ggplot(aes(x = temperature, y = weekly_sales)) +
geom_smooth(method = "lm", formula = y ~ x + I(x^2))

The model output demonstrates a curvilinear, or inverted U-shaped
relationship (visualized below). People are less likely to shop retail
on a freezing cold day. Increasing temperatures are associated with
increased sales, but only to a point. As temperatures become excessive
and dangerous, sales start to decrease.
If we were managing Walmart’s promotions we could offer larger
discounts when the whether is at either extreme and perhaps even
increase the price of certain products when the temperature is
mild.
Predictive Analytics
Now that we have a model that is fairly robust we will use it to
make predictions of weekly sales revenue.
# Setting seed for reproducibility
set.seed(3.14159)
# Splitting the data set into a training dataset (75%) and a test dataset (25%)
dfw_split <- initial_split(dfw)
dfw_train <- training(dfw_split)
dfw_test <- testing(dfw_split)
# Fitting the model
fit_org <-
linear_model %>%
fit(weekly_sales ~ . - date - store + I(temperature^2), data = dfw_train)
summary(fit_org$fit)
##
## Call:
## stats::lm(formula = weekly_sales ~ . - date - store + I(temperature^2),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -557260 -165114 -25112 115048 1913671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.546e+05 4.725e+04 5.389 7.42e-08 ***
## isholidayTRUE 6.038e+04 1.397e+04 4.323 1.57e-05 ***
## temperature 3.056e+03 1.068e+03 2.861 0.00424 **
## fuel_price -1.939e+04 7.819e+03 -2.480 0.01316 *
## cpi -9.217e+02 9.640e+01 -9.561 < 2e-16 ***
## unemployment -1.058e+04 1.992e+03 -5.312 1.14e-07 ***
## size 4.826e+00 5.496e-02 87.809 < 2e-16 ***
## I(temperature^2) -1.628e+01 9.058e+00 -1.797 0.07237 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 239000 on 4818 degrees of freedom
## Multiple R-squared: 0.6248, Adjusted R-squared: 0.6242
## F-statistic: 1146 on 7 and 4818 DF, p-value: < 2.2e-16
# The linear regression output as a tibble
tidy(fit_org)
# Creating a new dataframe with predicted values
results_org <-
predict(fit_org, new_data = dfw_test) %>%
bind_cols(dfw_test) %>%
rename(Predicted_Sales = .pred)
## rename: renamed one variable (Predicted_Sales)
results_org %>%
arrange(date)
# Defining the metric set we will be working with to evaluate the models
perf_metrics <- metric_set(rmse, mae)
# Calculating the performance of fit
perf_metrics(results_org, truth = weekly_sales, estimate = Predicted_Sales)
When we remove the temperature term, something notable occurs.
First, our Adjusted R-Squared value diminishes slighlty (from 62.2% to
62.1%), making it a slightly less appealing model in terms of explaining
the variance in weekly sales. However, we also observe that the error
has been reduced, making fit_nosq relatively superior in terms of
predictive capability. Since we are trying to build a reliable
predictive model, we exclude the term and conclude that
fit_nosq is better for that purpose.
More Predictive Modeling
We are fairly pleased with both the explanatory and predictive power
of fit_nosq but of course we would like to improve upon both metrics.
One issue that we have not yet discussed is the variability in weekly
sales across Walmart locations, as shown below.
# Calculating total weekly sales per store
sales_by_store <- aggregate(weekly_sales ~ store, data = dfw, sum)
# A bar chart showing the distribution of weekly sales revenue by store
ggplot(sales_by_store, aes(x = store, y = weekly_sales)) +
geom_bar(stat = "identity", fill = "#0078D4") +
labs(title = "Total Weekly Sales by Store", x = "Store Number", y = "Total Weekly Sales")

This final model uses a log-linear regression to explain weekly
Walmart sales using store-level, economic, and seasonal predictors.
Applying a log transformation to weekly sales substantially improves
model performance, yielding an adjusted R² of 0.71, and stabilizes
variance across stores with vastly different revenue scales. Overall
model fit is strong, with well-behaved residuals and a highly
significant F-statistic, indicating that the included predictors jointly
explain a meaningful share of sales variation.
Results show that store size is the dominant driver of weekly sales,
dwarfing macroeconomic effects and confirming that physical scale
largely determines revenue potential. Holiday weeks are associated with
an average 6–7% increase in sales, validating the importance of seasonal
demand spikes. Inflation, proxied by CPI, has a small but statistically
significant negative relationship with sales, even after controlling for
store characteristics. Temperature and unemployment exhibit modest
effects consistent with economic intuition, while fuel prices do not
appear to meaningfully impact sales once other factors are accounted
for.
This specification represents the best balance between
interpretability, explanatory power, and robustness among the models
tested. The log transformation enables clear percentage-based
interpretations while materially improving fit relative to linear
alternatives, making the model suitable for both analytical insight and
downstream forecasting.
Limitiations
- The model does not explicitly account for store-level fixed
effects or regional hierarchies, which may mask persistent
location-specific dynamics.
- Temporal structure is handled implicitly; autocorrelation and
seasonality are not directly modeled.
- The analysis assumes linear relationships on the log scale and may
understate nonlinear or interaction effects beyond those tested.
- CPI and unemployment are measured at broader geographic levels and
may not fully capture local economic conditions.
Next Steps
- Implement mixed-effects (hierarchical) models to capture
store-specific variation.
- Explore time-series approaches for improved short-term
forecasting.